{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "from pandas import Series, DataFrame\n",
    "\n",
    "s = pd.read_csv('../data/taxi-distance.csv', header=None).squeeze()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Beyond 1\n",
    "\n",
    "Compare the mean and median trip distances. What does that tell you about the distribution of our data?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "count    9999.000000\n",
       "mean        3.158511\n",
       "std         4.037516\n",
       "min         0.000000\n",
       "25%         1.000000\n",
       "50%         1.700000\n",
       "75%         3.300000\n",
       "max        64.600000\n",
       "Name: 0, dtype: float64"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "s.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Because the mean is significantly higher than the median, it would seem that there are some *very* long trips in our data set that are pulling the mean up. And sure enough, we see that the standard deviation is 4, but that we have at least one trip > 64 miles in length."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Beyond 2\n",
    "\n",
    "How many short, medium, and long trips were there for trips that had only one passenger? Note that data for passenger count and trip length are from the same data set, meaning that the indexes are the same."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0\n",
       "short     4333\n",
       "medium    2387\n",
       "long       487\n",
       "Name: count, dtype: int64"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "passenger_count = pd.read_csv('../data/taxi-passenger-count.csv', header=None).squeeze()\n",
    "\n",
    "pd.cut(s[passenger_count == 1], \n",
    "       bins=[s.min(), 2, 10, s.max()], \n",
    "       include_lowest=True,\n",
    "       labels=['short', 'medium', 'long']).value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Beyond 3\n",
    "\n",
    "What happens if we don't pass explicit intervals, and instead ask `pd.cut` to just create 3 bins, with `bins=3`?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([-0.0646    , 21.53333333, 43.06666667, 64.6       ])"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "passenger_count = pd.read_csv('../data/taxi-passenger-count.csv', header=None).squeeze()\n",
    "\n",
    "pd.cut(s[passenger_count == 1], \n",
    "       bins=3,\n",
    "       labels=['short', 'medium', 'long'], retbins=True)[-1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0\n",
       "short     7179\n",
       "medium      26\n",
       "long         2\n",
       "Name: count, dtype: int64"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.cut(s[passenger_count == 1], \n",
    "       bins=3,\n",
    "       labels=['short', 'medium', 'long']).value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`pd.cut` took the interval from `s.min()` to `s.max()`, divided it into three equal parts, and assigned those to be `short`, `medium`, and `long`. We can see, though, that this meant our `long` category is from 43 miles to 64.6 miles -- numerically one-third of the values' interval, but only including a handful of values!"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
